Automatic detection and correction of annotation errors in Polish language corpora

Automatic detection and correction of annotation errors in Polish language corpora72013-09-16 09:25:10LukaszKobylinski62013-09-16 09:21:21MichalLenart52013-09-16 09:21:01MichalLenart42013-09-16 09:20:09MichalLenart32012-02-23 14:41:48LukaszKobylinski22012-02-13 14:37:44LukaszKobylinski12012-02-13 14:37:33LukaszKobylinski

Project factsheet English name: Automatic detection and correction of annotation errors in Polish language corpora Polish name: Automatyczne wykrywanie i korekcja błędów anotacyjnych w polskich korpusach językowych Project type: A National Science Centre research grant (number 2011/01/N/ST6/01107) Duration: 21 December 2011 ‒ 20 December 2013 Principal investigator: Łukasz Kobyliński Institution: Institute of Computer Science, Polish Academy of Sciences

Project summaryThe main goals of the project are as follows: to improve the already known methods of automated detection of annotation errors in text corpora (on the morpho-syntactic level), to develop an accurate method of such error detection for Polish language resources and to provide an efficient tool, which may be used to automatically correct tagging errors in English and Polish corpora. The quality of the low-level (morpho-syntactic) corpus annotation is crucial, as the annotation is used to train automated taggers themselves. Often a gold-standard subcorpus is selected from a larger collection of documents and it serves as the training material for taggers, which are then used to annotate the complete corpus. Precision of annotation in such a subcorpus influences the tagging quality of the entire corpus and thus has a direct impact on the accuracy of other, higher levels of text processing, e.g. semantic layers of annotation.